Basics: what are observations and forecasts?

First, we need to stop thinking data as “points”. They are not the outcome of a deterministic process, but they include randomness. If they are random, any forecast has to depict this randomness.

This is a more honest representation of the data, and it already gives us a region for the forecast. These distributions depict the probability of where the next observation will be. The expectation is that it will be where the probability is at its maximum (centre of the normal distirbution), but real data have other plans.

In this simple case what should our forecast be? Let’s think of a few cases. Suppose we forecast in the middle (where the probability is maximised - called “expectation” in statistics), above it, or below it.

When will the forecast error be smaller? We cannot answer this for a future unknown observation, but we can asnwer it on average for “all” future observations. Let’s count the errors if we would repeat the process 1000 times, i.e., forecast in the middle, above and below. (Most of the fancy statistics is just that: throw the ball against the wall infinite times and see what happens.)

Looking at raw errors rarely helps. There are two bits of information that we are really interested in, first, the location of the errors, called bias. Am I over- or under-forecasting on average? I want to have as close to zero bias as possible. This is simply the mean of the errors (the red dots in the boxplots). Second, the variance, which we often understand as accuracy. This is often the root mean squared error. We square the error because we do not care if we over- or under-forecast (remove the +/-), and take the square root to return everything in the same scale. Recall from your favourite subject at school (obviously I’m talking about physics) that we need to carry the units with our equations. If the time series is in SEK, the forecast is in SEK, the errors are in SEK, but the mean sqaured errors (MSE) are in SEK^2. The root (RMSE) is again in SEK.

What am I trying to get to?

Observe two things. * The boxplot of the errors (not the square errors) gives us already the interval for the forecast.It really says your forecast is where you thought it would be +/i these errors. # The error^2 are just the variance, a away to quantify that interval is the variance.

And finally: * The formula for the variance is 1/n sum((x-mean(X))^2). The MSE (mean squared error) is 1/n sum(x^2). The only difference is that the mean(x) is not removed in the case of MSE. Why? Because the “correct forecast is the unbiased forecast: mean(error) = 0. This will have some very useful implications later on.

Where does uncertainty come from?

Is is all the information we do not know or our misunderstandings about the system we are trying to model. Suppose you’re running an ice cream store and you have 5 customers, who may buy from 0 to 4 ice creams with equal probability (so, 1/5 options = 20% probability). The middle of 0 to 4 is (0+4)/2 = 2. You have 5 customers, so a simple forecast is 2*5 = 10. Here are your sales for a single day.

For 20 days

What do we see here? The probability of purchase for each customers starts to get closer to the 20% for each 0 to 4 ice creams. On the aggregate level (your sales) the forecast of 10 is sometimes spot on, some times under- and sometimes over-forecasting.

For 1000 days we get

The sales end up being a normal distribution with a meam of 10, what we expected from our simplistic calcuclation. Why is it a normal distribution? This is what the Central Limit Theorem tell us to expect the distribution of the sums will eventually become normal. So we do not know why each customer buys the number of ice cream they buy, but when we examine the aggregate sales, the problem is much simpler to model. It is just a normal distribution. The forecast is its mean, and its uncertainty is its variance. When we “modelled” the sales we did not need any knowledge of how consumers behave. The sample mean and variance are enough.

This is core in any predicitve modelling. We do not need to know too many details. Out task is to simply estimate a mean and a variance. The mean here was very easy to estimate because it has no structure/dynamics (it does not change over time - for instance people who do not know the real joy in life consume ice creams seasonaly, waiting for warm weather - insane!) The mean then is more complex to calculate. The same logic applied for the variance.

We need a term for that: conditional mean and conditional variance. Conditional on what? On all information that we have up to this point. So that will typically be time and any other explanatory variables we have.

A more realistic example (for when we reach ARIMAs)!

Here is a forecast:

This forecast is incomplete. It only provides us with the conditional mean for the future months. Note how the forecasts follows the seasonal pattern. This is not the mean, it is the conditional on the time period. For example, the mean is different for January and for June.

Here is the forecast against what happened - I zoom in the last part..

We should never expect the forecast to be spot on what will happen. Remember, the noise is unforecastable, it is the information we cannot know. Evem if here the forecast is pretty close, it still remains a lucky guess, and we do not have any insight in how confident is the model in its forecast.

Here is the complete picture with the point forecast (the conditional mean) and its uncertainty (which depends on the conditional variance). The uncertainty interval (technically the prediction interval) contains the actuals. So it is not about the point prediction (red line) matching the actuals, but rather the actuals being contained in the intervals. Even if we had perfect knowledge, and our model would be perfect, we would still not be able to get the actuals spot on. Why? Because noise is unforecastable.

If we look carefully at the plot, the black line, what eventually happened is not always contained in the intervals. The plotted intervals at their widest are the so called 95% prediction intervals. This means that we expect 1 out of 20 times (5% probability) the actual data to be outside the intervals. This is assuming that everything is calculated correctly (our conditional means and variances are good approximations of what is happening with the data).The more accurate a model is, the narrower the intervals, and they will more fairly cover the probabilities they claim they do.

As an example, here is a terrible forecast

In principle, this is still a usable forecast, but its uncertainty is very high. If we would translate this forecast into a decision, we would need to account for a ton of risk. For example, with the good forecast, given the uncertainty we should plan for +/- one more trip to carry all the passengers. With the nad forecast we would need to plan for +/- hundreds or trips to account for the uncertainty in the passenger forecast. Planning for all this contigency is too expensive!

Note that the prediction intervals expand over time. What is happening there we will see it in detail when we reach ARIMA models, but simply put, we accumulate uncertainty from each previous forecast period. This sentence statistically says: your forecast errors are correlated, and this correlation inflates the uncertainty. This sentence will be increasingly annoying as we understand more the mathematical side of modelling.

Now we are ready for regression!

Regression!

Regression is the most fundamental tool we have in statistics. A massive class of modelling problems is a regression. Most of the stuff you will see in your stats related modules or the implicit models in management theories/questions will be a regression. From simple statistical tests, to complex language models, it is just a regression with more or fewer tricks. If you master regression, everything becomes very very easy.

Regression, at its basics, tries to tell you how strong is the observed connection between variables. It claims nothing on the causal structure of the target variable. This is a very common mistake in how people use regression, they find a connection, and then overreach by claiming that some causality is uncovered. This is not what regression does, and is epistimologically flawed to make these claims. (A ton of management, economics, biology, medicine, psychology papers are unfortunately just flawed, by misunderstanding or overclaiming the regression output. This is not a rant, it is more of a: be critical when people suggest that a model says something - they often just try to dress a narrative with numbers.)

Bivariate linear regression

The model is: \[ y_t = \alpha + \beta x_t + \varepsilon_t\] where \(y_t\) is the target in period \(t\), \(\alpha\) is a constant (sometimes called intercept), \(\beta\) is the coefficient of the independent (or regressor or covariate) variable \(x_t\) and \(\varepsilon_t\) is the innovation term. Innovation term is a funny way to say noise. Strictly speaking innovation term implies that novel information that is happening because of the stochastic nature of the data (e.g., how many ice creams I am going to eat today, from the seller’s perspective). The innovation term is not the same as the residuals or the error. I will return to this point latter. For now think of \(\varepsilon_t\) as the noise. Why is it in the equation? This is what will become our prediction intervals. It will not affect the point prediction (the expectation).

The bivariate regression is linear - what does this mean? As the regressor changes, the left hand side, \(y_t\), will change additively as the coefficients suggest. Let me illustrate the point a bit better. This is linear \[ y_t = \alpha + \beta x_t + \varepsilon_t\] This is linear as well: \[ y_t = \alpha + \beta x_t^2 + \varepsilon_t\] The regressor changes by \(beta\) times whatever value \(x_t^2\) takes, the change is defined by \(\beta\) and it is additive \(+\beta (x_t^2)\). The model does not know there is an \(x_t^2\). I could just as well write \(z_t = x_t^2\) and $ y_t = + z_t + _t$. These different write ups of the model are equivalent and all linear. Note how the coefficients interacting with the data:

\[y_t = \alpha + \beta \times something + \varepsilon_t\] What “something” is none of the model’s business. Here is a nonlinear example for contrast: \[ y_t = \alpha + x_t^\beta + \varepsilon_t\] Notice where is the coefficient \(\beta\), if \(x_t\) changes by 1, the change of \(y_t\) is no longer additive by \(\beta \times 1\) but it is some exponential. Why am I making all this discussion? People like to say linear models are straight lines, but they are not necessarily. The relationship between \(y_t\) and \(x_t\) may be nonlinear. The regression remains linear because it is a linear combination of its parameters (a linear combination of \(\alpha \times 1\) and \(\beta \times x^{whatever}\)). We will need this later, but also in case you plot a regression and it does not look like a straight line, this is why.

Notation tip: When we refer to some unknown (often implying the true parameter that generates the data) we use Greek letters. When we refer to estimated parameters we either use “hats” or English letters: \(\hat{\beta}\) or \(b\) and quite often people use both just in case \(\hat{b}\)!

You have probably heard that correlation is not causation. You can also add to that regression is not causation. We want to make sure relationships are linear (or look linear-ish) before we start building our regression. Now here is where textbooks and many people get a bit confused. People say: first plot the data - I say Meh, good luck spotting linearity with that!. Here is why:

Both are linear regressions. It has to be linear in the coefficients, not in the connection with \(x^2\). We can check whether there is a leftover non-linearity in the residuals of the regression. Below are the residuals \(e_t = y_t - x_t\) of the previous regressions.

I plot the dependent variable versus the errors. If there was some nonlinear information leftover in the \(\hat{\beta} \times\) regressor it would show up here. For example in the model $ y_t = + x_t^2 + _t$ let’s ignore the quadratic (\(x^2\)) and fit it with just \(x\):

There is a clear nonlinearity left in the residuals. This hint to us that our model is not great, and we should try to build a better regression. (We could start by transforming \(x\), but these is nothing stopping us from transforming \(y\).)

Do not take this as: we do not need to plot the data! We still need to plot the data to get an idea of what we are dealing with, and spot outliers and structural breaks. We can spot all these from the residuals, but it is often faster to start from just plotting our data.

We have three regressions from the examples above. Let’s see what the estimated model tell us:

## 
## Call:
## lm(formula = y1 ~ x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.3904 -2.8682  0.6303  1.5356  7.7259 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.2881     1.2340   1.854 0.096694 .  
## x             2.5188     0.3902   6.455 0.000117 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.093 on 9 degrees of freedom
## Multiple R-squared:  0.8224, Adjusted R-squared:  0.8026 
## F-statistic: 41.67 on 1 and 9 DF,  p-value: 0.0001175

First. residuals rows: very uninformative! Just plot them. The coefficients part is where we get the most interesting information. Let’s understand what we are told there. We pick the line of \(x\) first. * The coefficient is: \(\hat{\beta}_{x} = 2.6\). The real information here is that it is positive. The size of the coefficient can be misleading. We return to this when we look at multiple regression. * The standard error is: \(\hat{\sigma}_{\beta_x} = 0.39\). What does this mean? This standard error connects to the idea of standard deviation. Standard deviation of what? Of the errors in estimating the coefficient size. Imagine I could ask for a different set of \(y_t\) (and \(x_t\)) values from the same data generating process. Because of the noise \(\varepsilon_t\) the estimates will be somewhat different. The more the noise, or the less the data, the higher the differences. If I would collect all these estimates I would get a distributon of \(\hat{\beta}_{x}\), that distribution has some standard deviation. This is what the standard error is. Much like the discussion about the forecasts before, the narrower this is, the more precise is the estimate and the less uncertainty is associated with it. The calculation of the standard error is in fact rather different, based on analytical formulas, but the intuition is what we have here. * If the noise in the data can change the values of the coefficients, then a reasonable question is: how do we know that the coefficient is correctly estimated? In general we do not know the correct value, so we cannot meaningfully do the calculation \(\beta - \hat{\beta}\). This is really what we would like to calculate. Instead, we at least can ask, do we have enough evidence that \(\beta \neq 0\)? We want to know that we did not pick up a random \(\beta\) by overinterpretting the noise in the data. We will do a t-test here. The logic is: I have the value for \(| \hat{\beta} - 0 |\), is this too small to be unsure that \(\beta\) exists? (exists means \(\beta \neq 0\)). Before I said the size of the coefficient can be misleading. We want to normalise it by something. We normalise it by the size of its estimation errors. WHy? Remember, the argument about units before. The size of the coefficient has some units (it is really the scale we are interested, the coefficient strictly speaking is unitless), the same is true for the size of the standard errors. If we divide these two we get a clean number (in scale actually! we get a normalised distribution). This is what the t-statistic (t value above) is: \((\hat{\beta} - 0)/\hat{\sigma}_{\beta_x} = 6.46\). What do we do with that number? It is a t-test, so we look what this value corresponds to on a t-student distribution and get the p-value. Heuristically speaking values about \(|2|\) corresponds to 5%. What is this percentage telling us? Here the p-value is 0.000117. This is the probability that \(\hat{\beta} = 0\) because of the sampling uncertainty- It tells us nothing about \(x\) being in the model or not!* We cannot simply say, so \(x\) has no impact. We can only say that we wasted our time (and data) estimating \(\hat{\beta}\) for this sample of \(y_t\). Visually what we are saying is:

There is only a tiny chance for the estimated coefficient to be zero (the probability that the p-value gives).

For the sake of clarity, let’s repeat the same with the third regression where we fit a straight line on the \(x^2\) data.

Here we have serious doubts that the estimation of the coefficient is reliable. (Strictly speaking, the visualisation here is a bit misleading, as the test is two-sided and not one side. Here I plotted the whole p-value on one side. This is because I changed the perspective of the test from the 0, as it assumes \(\beta = 0\) to be centred around the estimated coefficient.)

Back to the previous output:

## 
## Call:
## lm(formula = y1 ~ x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.3904 -2.8682  0.6303  1.5356  7.7259 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.2881     1.2340   1.854 0.096694 .  
## x             2.5188     0.3902   6.455 0.000117 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.093 on 9 degrees of freedom
## Multiple R-squared:  0.8224, Adjusted R-squared:  0.8026 
## F-statistic: 41.67 on 1 and 9 DF,  p-value: 0.0001175

What is a good p-value to test against? If we do our science correctly, there is none! We have to caclulate the correct p-value from our experiment (what are we testing, how many participants, etc.). This is very rarely done, so people have kept a dogmatic adherence to 5%. Why 5%? Why not 5.1%? Because people like crisp rules, but there is none :)

Personally I ignore marginal p-values. More importantly, p-values do not tell us anything about the variable having any causal connection with the target, they only make a statement about our estimation. When we have an experimental design build on a theory driven hypothesis, this may be valid. When we do data mining, i.e., we try to build a bunch of models to figure out what is going on with our data, then we have no theory driven hypothesis, and the p-values do not allow us to conclude that variables should not be in the model. The statistical rationale for saying we can remove the variable is very different. Every coefficient we estimate uses data. If we know a coefficient is not well estimated, then we throw it away, and re-estimate the model with “more available data”. *Personally, I very rarely look at p-values, and I always teach to not use them at all”. The chances of misusing them are much higher than getting anything useful out of them. They also have a fallacy in their maths, but I will return to that later.

Recall that predictions are uncertain. The width of the uncertainty is based on a standard deviation or variance calculation. For regression models this is the residual standard error.

I mentioned above that any coefficients uses up data. This is what the degrees of freedom is trying to tell us. The lower the number the less reliable the estimation is. Here the degrees are 9. I used 11 data points, I wasted one for the constant, and one for the coefficient of \(x\). I am left with 9 degrees of freedom. In so simple models there is little more to say about this here. Unless we go into the analytical side of modelling, this is probably the last you will hear about this.

  • Multiple-R squared: it says how much percent of your data is explained by your model. The remaining up to 100% is currently left in the errors of the model. Often people try to maximise their \(R^2\) - this is very wrong! Let us understand why. As we show in the first plots, every observation = sturcture + noise. How much is the noise is unknown, but we know that our data has noise. If the \(R^2\) becomes 100% that means that the model has explained the observations fully, which means it has explained the noise. But recall that the noise is the information we cannot recover, so it is impossible to predict it or model it. If we try to tweak our regression to maximise \(R^2\) all we are doing is overfit to the noise in the data and get a silly model as an output. That model will try to second-guess randomness, rather than try to find a reliable conditional expectation. In our ice cream example, that would be the same as every day moving your forecast around from 0 to 20, just because. The R^2 can only tells us how much of the data remains unexplained so that we do not overclaim with our model. It cannot be used as a statistic to optimise during model building.

  • Adjusted R-Squared: This is obsolete, so we have stopped using it. It is there only to confuse colleagues who are forced to teach statistics without having had the chance to study it, and to really confuse students. Some people will claim that this is useful for multiple regression. When we reach there, we will see why this is not the case.

  • F-statistic: This is like a p-value for the whole regression. Is the whole regression, all the coefficients together well estimated? As using p-values for individual coefficients is misleading, so is using the F-statistic in the contexts we use regression with business data.

Important point: Some textbooks claim that there is a distinction between descriptive and predictive models. They make this claim to get the “okay” to abuse p-values and statistical tests. There is very solid science that demonstrates this is misguided. The only test of validity of a model is a predictive test. If you model cannot predict skillfully, then it is useless, no matter how many p-values indicate thumbs-up! I will return to this point when we reach multiple regression, to give you some solid examples of this fallacy. (Epistimologically speaking, you cannot validate a theory or a statement by just playing around with your sample of data. You either demonstrate the veracity of a theory by its ability to predict, or you demonstrate that your deductive hypotheses cannot be rejected at the face of sufficient evidence. The second option is very difficult to do in practice, because of the stringent requirements it places on what is a theory and what is a hypothesis. Someone enjoyomg a glass of cognac and saying their ideas is surely not a theory! That is a good time! No p-value can make that a theory.)

You may have noticed that I did not make a big deal about the constant. I will touch on this when we detail multiple regression.

Multiple regression

This is the same story but with more explanatory variables. The easy way to write it is: \[y_t = \beta_0 + \beta_1 x_{1,t} + \beta_2 x_{2,t} + \ldots + \beta_p x_{p,t} + \varepsilon_t\] Like before the \(x\)’s can be nonlinear or whatever. The regression is linear in its coefficients. If you are familiar with linear algebra, multiple regression is typically written as: \[y = Xb + \varepsilon\] where \(y\) is a vector of all \(y_t\), \(X\) is a matrix containing all different \(x_t\)’s as columns and \(\beta\) is a vector containing all coefficients. The only reaosn I bring this here is because I want you to observe that the constant is nothing special (in \(X\) it is typically a column of 1’s) and that in this formulation the \(Xb\) will always show its linear nature and won’t be masked by nonlinear \(x\)’s.

Let’s take a simple case: \[y_t = \beta_0 + \beta_1 x_{1,t} + \beta_2 x_{2,t} + \varepsilon_t\] How does this look if we want to plot it. There are three variables \((y,x_1,x_2)\) so we need three dimensions. Bivariate regression is typically thought of as a line (although we shown that this is not necessary), a multiple regression with two \(x\)’s is a plane.

And the model output now looks like:

## 
## Call:
## lm(formula = y ~ x1 + x2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.3445 -1.1375 -0.6377  1.1756  3.3059 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   5.6776     1.2156   4.671 0.000219 ***
## x1            0.7853     0.1714   4.582 0.000265 ***
## x2            0.3982     0.1893   2.104 0.050613 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.912 on 17 degrees of freedom
## Multiple R-squared:  0.5535, Adjusted R-squared:  0.501 
## F-statistic: 10.54 on 2 and 17 DF,  p-value: 0.001055

A question that will arise is: do I need both \(x_1\) and \(x_2\)? If we go with the p-value logic \(x_2\) would be rejected, but as covered above, this is not valid. In fact the tools to decide whether to keep all variables in the model or not are not provided by this table.

Before we go there, let’s trick our statistics. We will focus on \(R^2\). We will trick it by introducing random explanatory variables. Random, as in generated completely randomly from a uniform distribution with some fake noise. Below is the result with 10 additional random variables.

## 
## Call:
## lm(formula = y ~ ., data = X)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.3478 -0.6777 -0.1192  0.5159  3.2308 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  5.25961    4.38666   1.199   0.2695  
## x1           0.83470    0.27388   3.048   0.0186 *
## x2           0.36728    0.30017   1.224   0.2607  
## x3          -0.51736    0.40547  -1.276   0.2427  
## x4           0.03224    0.48962   0.066   0.9493  
## x5           0.57373    0.52746   1.088   0.3127  
## x6          -0.04713    0.43663  -0.108   0.9171  
## x7           0.12044    0.36642   0.329   0.7520  
## x8          -0.11855    0.30843  -0.384   0.7121  
## x9          -0.07150    0.25958  -0.275   0.7909  
## x10          0.19868    0.30709   0.647   0.5383  
## x11         -0.14188    0.25851  -0.549   0.6002  
## x12          0.23429    0.40478   0.579   0.5809  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.207 on 7 degrees of freedom
## Multiple R-squared:  0.755,  Adjusted R-squared:  0.3351 
## F-statistic: 1.798 on 12 and 7 DF,  p-value: 0.2227

\(R^2\) jumped upwards! Remember these are just random numbers, there is no way they could explain the data in any meaningful way. Let’s increase more! Below is the result with our 2 initial ones and 16 random variables (why 16? I got 12 variables + intercept in the model = 13 parameters, 20 data points - 13 parameters leaves me with 7 degrees of freedom. If I got to 16 additional variables I will have 16 random variable + 2 (x1 and x2) + 1 (intercept) = 19. This will leave me only one degree of freedom, beyond that it is no longer statistics! Also, compare the p-values for the intercept, x1, x2. \(R^2\) is improving, but p-values are suggesting our previously important variables are becoming weaker.

## 
## Call:
## lm(formula = y ~ ., data = X)
## 
## Residuals:
##         1         2         3         4         5         6         7         8 
## -1.673519  0.380125 -1.596578  0.732222  0.918447  0.005976 -0.057238 -0.340604 
##         9        10        11        12        13        14        15        16 
##  0.342207  0.267987 -0.150021  0.370627  0.244341  0.181622  0.949684 -0.402707 
##        17        18        19        20 
##  0.221750 -0.556938 -0.171953  0.334571 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept) 24.826137  18.062941   1.374    0.400
## x1          -0.325269   1.066405  -0.305    0.812
## x2           0.155925   0.514392   0.303    0.813
## x3          -0.992714   0.831739  -1.194    0.444
## x4           0.052733   1.052990   0.050    0.968
## x5          -1.811241   1.348342  -1.343    0.407
## x6          -0.449697   1.270616  -0.354    0.783
## x7          -0.430461   0.854260  -0.504    0.703
## x8          -0.797673   0.958895  -0.832    0.558
## x9           0.381827   0.737000   0.518    0.696
## x10         -0.231894   1.671322  -0.139    0.912
## x11         -0.477259   0.996282  -0.479    0.716
## x12          0.899067   0.874075   1.029    0.491
## x13          0.309278   1.101247   0.281    0.826
## x14          0.005495   0.541918   0.010    0.994
## x15         -0.742389   0.750013  -0.990    0.503
## x16         -1.042778   0.808949  -1.289    0.420
## x17         -0.265788   0.520764  -0.510    0.700
## x18          0.112755   0.818649   0.138    0.913
## 
## Residual standard error: 3 on 1 degrees of freedom
## Multiple R-squared:  0.9354, Adjusted R-squared:  -0.2283 
## F-statistic: 0.8038 on 18 and 1 DF,  p-value: 0.7206

Almost perfect \(R^2\)! Just for the sake of the argument, let’s use up all degrees of freedom (data) with 17 random variables.

## 
## Call:
## lm(formula = y ~ ., data = X)
## 
## Residuals:
## ALL 20 residuals are 0: no residual degrees of freedom!
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.299398        NaN     NaN      NaN
## x1           1.144130        NaN     NaN      NaN
## x2           0.156894        NaN     NaN      NaN
## x3           1.571057        NaN     NaN      NaN
## x4          -0.460222        NaN     NaN      NaN
## x5          -0.175609        NaN     NaN      NaN
## x6           1.047185        NaN     NaN      NaN
## x7           0.672103        NaN     NaN      NaN
## x8          -0.004348        NaN     NaN      NaN
## x9           1.455200        NaN     NaN      NaN
## x10         -1.414761        NaN     NaN      NaN
## x11          0.728556        NaN     NaN      NaN
## x12         -0.948662        NaN     NaN      NaN
## x13         -0.723780        NaN     NaN      NaN
## x14          0.807005        NaN     NaN      NaN
## x15          0.274932        NaN     NaN      NaN
## x16          0.665136        NaN     NaN      NaN
## x17          0.510357        NaN     NaN      NaN
## x18         -0.729290        NaN     NaN      NaN
## x19         -1.141317        NaN     NaN      NaN
## 
## Residual standard error: NaN on 0 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:    NaN 
## F-statistic:   NaN on 19 and 0 DF,  p-value: NA

This is a bit more profound result than what this table suggests. I have enough coefficients to match the number of data points (my degrees of freedom is 0). In this case we just solve a system with 20 values and 20 unknowns as if you were doing some basic algebra. For this case there is only one solution. If I went to use even more variables then I would have a parametric solution, you would need to give me some values for the unknowns (the coefficients) before I could estimate anything. You probably learned that regression is the unique line that minimises the errors, so why that unique line is any different than the unique line with the zero degrees of freedom?

Let’s use some plots of the actuals and the fitted regression lines, to see what is happening.

To correctly interpret what is going on here we need to keep in mind two things. First, data = structure + noise and noise is unforecastable. Second, the more coefficients I give to a model, the more I can “curve” a line. Let’s start from the second line. We said that a regression with two variables is a plane, not a line. A plane can rotate in more ways than a line, it has one more axis of rotation. If I go in 4 dimensions (3 variables) I gain on more axes of rotation, and so on. This means when I look at the fitted line it can twist in more ways than just its slope. If I give a model 20 coefficients, it has 20 axes to twist a line to fit 20 data points (0 degrees of freedom, i.e., as many data points as the coefficients.)

Look at the RMSE (Root Mean Squared Error) of the large model, RMSE = 0. What does this mean? There is no unexplained error in the data, the model can account for everything happening there. But, data = structure + noise, how can a model have zero error when there is noise? It has overfit to the data, it has overexplained them, i.e., it has tried to follow around the noise. Overfit models are useless for two reasons: they cannot predict, and they mislead you with their statistics.

Now let us look at the model that is just a flat line. Standard regression theory says this is a unique line. But the model with all the variables is the unique line. Which is one correct? What does unique means in these two cases? Let’s understand why the flat line is where it is.

You probably have seen this plot:

We would square the errors, so that we care only about their magnitude, and minimise that. What does this imply? The calculation is \(sqrt(1/20\sum_i^{20}(y_i - \beta)^2)\). Let’s plot the result for different values of the coefficient.

Let’s validate that the value is the same

## 
## Call:
## lm(formula = y ~ 1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.7861 -0.9820  0.4665  1.5796  5.3797 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  10.1459     0.6053   16.76 7.68e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.707 on 19 degrees of freedom

The shape of the error surface is called convex, that tells us that there is one unique minimum point. So as I try different values for the coefficient, I will find one that will minimise the average squared error. Why do I insist on the average? Because it is the average error across the 20 observations. Let’s check something else, what is the mean of the observations?

## [1] 10.1459

So the coefficient that we find by minimising the squared error is the average of the data. Here is the mechanism of what is really going on: When we minimise the squared errors we estimate a conditional mean. The mean is conditioned on whatever inputs the model has, the x’s. This model has just an constant (we multiply the coefficient with x={1, 1, …, 1})), so it can find only a single number. That single number will always be the sample average. If we work through the equations we will find that the minimum of squared errors is the mean, let’s do this, it is only a couple of lines.

This is the loss function, the MSE if we allow change sin the \(c\) constant. \[J(c) = \frac{1}{n} \sum_{i=1}^{n} (x_i - c)^2\] To find the minimum we take its derivative with respect to \(c\) the parameter we can change (we want to see when the rate of change of the error surface becomes zero, i.e., when the convex surface has reached the flat minimum): \[\frac{d}{dc} J(c) = \frac{1}{n} \sum_{i=1}^{n} 2(x_i - c) \cdot (-1)\] We want this to be equal to zero: \[-\frac{2}{n} \sum_{i=1}^{n} (x_i - c) = 0\] Get rid of the \(-2/n\) by dividing both sides by it \[\sum_{i=1}^{n} (x_i - c) = 0\] Expand: \[\sum_{i=1}^{n} x_i - \sum_{i=1}^{n} c = 0\] The summation of a constant is just that thing as many times as you want to sum it up, so if we sum up to \(n\) we get \(n \times c\) \[nc = \sum_{i=1}^{n} x_i\] We’re there: \[c = \frac{1}{n} \sum_{i=1}^{n} x_i\] Whenever you minimise squared errors you will end up with the conditional mean.

Just for context, if you were to minimise the absolute errors, you would get the median. Squared errors are sensitive to outliers, just like the mean, because these two are connected. Absolute errors are robust against outliers, just like the median, because these two are connected.

So that unique line we get form the regression is the “average” best line. Let’s go one step deeper.

We plot the regression for 5, 10, 15, 20 observations: The value changes, that is not unexpected. But here is the harder question. All these lines are unique optimal lines. Which of them is the optimal optimal?

First, let’s make a connection with what we said before, recall the standard error of a coefficient? That is a standard deviation. I can approximate it calculating multiple coefficients as I did above and get a distribution. The standard deviation of that distribution will approximate that number. How uncertain we are at the value we place to a coefficient. (Strictly speaking, this is wrong, we would have to resample 20 observations every time and look at the standard deviation of that distribution… which is connected to the noise, as the 20 observations will differ due to noise.)

None of these lines is optimal. All are conditional on the sample size. If I increase the sample by 1 observation, I’ll get a different number. As the sample keeps on increasing, these differences will become minimal, so we will have converged to a reliable number. Do we know if the number we have for 20 observations is reliable? Actually, yes, that is the use of the standard error, the t-statistic, and the p-value. They all tell us if the estimation is reliable, not if that variable should be in the model or not.

Now, let’s return to our three regressions we the different number of variables. Look at the RMSE there, the fewer the variables the higher it is, the more the variables, the smaller it is. (We already said, a model with more parameters can fit more easily to data, so this is expected.) If we try to build a model to minimise the error, we will end up with the largest (overfit) model, which is useless. Following the \(R*2\) gives exactly the same result. Likewise, the model with just the constant may be too simplistic, may not follow the structure of the data. So it does not only filter out the noise, but it filters out the structure as well. We call these models underfit, or biased.

For a modeller the question is which of these three regressions is best (well we know one is full of random variables, so surely not that one!). Before we answer that, let’s do another small experiment. Repeat the example with the changing sample size, for the regression with the x’s.

Notice how these coefficients change more dramatically than the simpler model with just a constant. Let’s plot that

Think the two models in terms of total variance coming from their coefficients. The model with the more parameters will tend to have more variability. This is another fundamental ting in statistical modelling: model variance increases with model complexity. The model with more variance is less stable as the data change. This does not have to be silly academic ways to change the data, but it could be something as getting the new sales figures or whatever you are modelling. (I have made a few simplifications here. First, we would not vary the sample size, but we would resample the same sample size. (Can you think why changing the sample size is a bad idea to get comparable numbers?) Second, model variance is the (conditional) variance of the estimator of \(y_t\), the \(\hat{y}_t\) amd not of the parameters, but for linear models there is an analytical connection between these two. The answer to my question here is: the degrees of freedom change, and as we saw, as I run out of degrees of freedom, my model statistics become unrealiable, so I prefer to keep that constant to have the quality of the model estimation constant as well, so any variability is due to model complexity and data noise.)

When we think of models we have two fundamental quantities bias (think it as structural bias, is my model too simple to explain the data?) and variance (is my model too volatile in its estimation because of its complexity?). These two are connected through the bias-variance trade-off. Here I will spare you the maths - it is not difficult, but the notation is a mess: wikipedia is one click away, we can go through it together.

\[ Predictive\ MSE\ (Mean\ Squared\ Error) = Bias^2 + Variance + Irreducible\ Error \]

The irreducible error is the level of noise, so we can pretend it is not there, and think the equation as excess predictive error (over noise) due to our modelling.

\[ Excess\ Error = Bias^2 + Variance \]

How do these two things move? The examples above suggest that Bias decreases with complexity (the model has more capacity to learn stuff), and the variance increases with complexity (the model has many parameters and becomes jittery). Below is a simplified schematic of these two.

This is the fundamental challenge in selecting the correct model or the correct variables. Every model is somewhere on the red line. This red line is unknown for the future data, for which we need a prediction. So the challenge is how to pick the model that balances bias and variance, so that it does not overfit or underfit, but fits just about right to the data. This model will approximate the underlying structure accurately and provide reliable predictions.

Before we proceed to show how to resolve this, let me demonstrate why overfitting gives terrible predictions.

The data contain no noise, so any issues are directly from how appropriate the model is. A polynomial cannot fit a sine wave when its inputs are just time (\(x=1,2,3,4,5,\ldots\)). If we increase the order of the polynomial it looks as if it is doing a good job, judging from the in-sample. There are overfit results, they pretend to have modelled something, but in fact they have just twisted around the data points due to their many parameters. They have not learned any of the structure of the data.

The majoirty of variable and model selection approaches operate on the principle: \[ Best\ Model = Best\ Data\ Fit + Penalty\ for\ Complexity\] The different approaches effectively just pick another penalty. Adjusted \(R^2\) is one of those approaches. The issue it has is that the penalty is uses nowadays considered to be miscalculated. (I am oversimplifying a bit! It is not explicitly separating the bias and variance, and hopes to achive that through the degrees of freedom. That does not work though.)

There are three widely used ways that are valid in 2026 that are relevant to the types of models you will face:

All these three have connections between them. Regression is the easy use case, but more advanced models like ARIMA, GARCH, etc., are typically specified using these tools.

Variable (model) selection

Since you are aware of multiple regression, I assume you have seen standard regression problems. Since now you have to deal with time series, let’s focus on dynamic regression, i.e., a regression that contains time dynamics (how yesterday affects today, and so on). What we go through here is also the basis of ARIMA models.

The fist component we need to understand is lags. Let \(y_t\) be the obsservation at period \(t\). The convention is that \(t\) is now. The previous period is \(t-1\), the next period is \(t+1\) and so on. If your data is daily, then these would be yestrday or tomorrow. The data could be monthly, and then it would be previous and next months, and so on. How does this look in terms of data? When we build regression models we need to organise the data in columns. All data in a row correspond to the data for a single observation. The same idea applies here.

We create a monthly time series. This is placed in the first column, and from that we create 5 lags. These are placed in the columns next to it.

##            Yt  Yt-1  Yt-2  Yt-3  Yt-4  Yt-5
## 2023-02 98.11    NA    NA    NA    NA    NA
## 2023-03 96.24 98.11    NA    NA    NA    NA
## 2023-04 91.45 96.24 98.11    NA    NA    NA
## 2023-05 70.67 91.45 96.24 98.11    NA    NA
## 2023-06 79.21 70.67 91.45 96.24 98.11    NA
## 2023-07 75.61 79.21 70.67 91.45 96.24 98.11
## 2023-08 87.22 75.61 79.21 70.67 91.45 96.24
## 2023-09 85.27 87.22 75.61 79.21 70.67 91.45

Each row corresponds to a month. When a regression reads the data, it will read it row by row. Let’s look at the second row (2023-03). The current value is 96.24. The value for t-1, the previous month is 98.11, which is taken from the time series from date 2023-02. The rest of the lags are empty, because we do not have data from so much in the past to populate them. In the third row I have a t-2 data point from 2023-02, so I can add it there, and so on.

Let’s look at the last rows of the table.

##             Yt   Yt-1   Yt-2   Yt-3   Yt-4   Yt-5
## 2025-06 109.18 127.34 117.66 115.11 125.83 120.97
## 2025-07 110.80 109.18 127.34 117.66 115.11 125.83
## 2025-08 103.61 110.80 109.18 127.34 117.66 115.11
## 2025-09 106.93 103.61 110.80 109.18 127.34 117.66
## 2025-10 105.32 106.93 103.61 110.80 109.18 127.34
## 2025-11 115.16 105.32 106.93 103.61 110.80 109.18
## 2025-12 113.06 115.16 105.32 106.93 103.61 110.80
## 2026-01  99.34 113.06 115.16 105.32 106.93 103.61

Look at the last row. The values in there are the values we get if we would read the 5 previous values from the column “Yt”, i.e., from 2025-12 to 2025-08. It is good practice to trim to rows that are not complete. Mathematically, they do not exist, as we need complete variables (each column). So the begging of our matrix becomes (I just removed the top 5 rows):

##            Yt  Yt-1  Yt-2  Yt-3  Yt-4  Yt-5
## 2023-07 75.61 79.21 70.67 91.45 96.24 98.11
## 2023-08 87.22 75.61 79.21 70.67 91.45 96.24
## 2023-09 85.27 87.22 75.61 79.21 70.67 91.45
## 2023-10 87.47 85.27 87.22 75.61 79.21 70.67
## 2023-11 84.74 87.47 85.27 87.22 75.61 79.21
## 2023-12 89.18 84.74 87.47 85.27 87.22 75.61
## 2024-01 90.36 89.18 84.74 87.47 85.27 87.22
## 2024-02 68.51 90.36 89.18 84.74 87.47 85.27

We want to build a regression. First we plot the data. I will look at two plots. How does our series looks like, and scatter plots between the target “Yt” and the lags.

Nothing special is happening with the data (trend, seasonality, outliers, structural breaks) - we have not talked about any of these yet. Here are the scatter plots

Notice the correlations. If I just take those and plot them as a barchart, I get the plot of the autocorrelation function. “Auto” + correlation, where “auto” means “itself” in Greek, so correlation to itself (not exactly, but your Greek lesson can wait!). We rarely write the full AutoCorrelation Function, but rather ACF.

The first bar is correlation of “Yt” to “Yt”, so it is always 1. The second bar is “Yt” to “Yt-1”. The number from the scatter plots should match the size and direction of the bar in the ACF. The horizontal blue lines are confidence intervals. For bars outside the intervals, there is significant evidence that their values are not (just) due to noise.

Let’s look at the correlation between “Yt-1” and the rest. That they are correlated is not a surpise. If Yt and Yt-1 and Yt and Yt-2 are correlated as pairs, so will Yt-1 and Yt-2. We want to have an idea of the clean information that Yt-2 carries for Yt, without the impact of Yt-1. This is the so called partial autocorrelation (PACF). This is how it looks for our data.

The first bar is the PACF value for the Yt-1, then Yt-2, and so on. Notice three things, first, the values are different, second the value for Yt-1 in the ACF and PACF is the same, as there is no lag before it to clean up, third, in the ACF we an almost sinusoidal wave across the bars, which is absent in the PACF. What is this wave? The correlations between lags that we plotted before. The ACF changes in a smoother way because of these correlations. The PACF has cleaned that information, so that wave is no longer there.

(How is the PACF calculated? We build a regression up to the previous lag and lock the coefficients. Then we estimate the coefficient for the current lag. To go to the next lag, we lock that coefficient, and estimate the next one, and so on. What does this do? It tells the regression: do not touch any of the contributions of the previous lags, just estimated what is the on-top contribution of the new lag without any redundancies.)

When we build a regression, to identify how many lags we want to include we care about the PACF. Regression does not like redundant information, because it leads to multicollinearity. The PACF has cleaned up that information, so only find lags that have contribution on their own. (This is very relevant for ARIMA, as this is the AR part of ARIMA.)

If we were to build a moving average (I have not introduce what this is, but think about it as a short average that rolls over the data) I do not worry about redundancies, if anything I want to consider those, as the function of the moving average is to deal with the randomness in the data. Let me make this statement a bit clearer. Suppose we have a constant + noise. As the noise is random, some times it will be over the constant and sometimes it will be under the mean. If I want to cancel the noise, all I need is to take an average. The noise over the constant will add to the noise under the constant (negative) and will cancel out, leaving us with just the constant. (Recall, this is what regression is trying to do by minimising the squared errors: estimate a conditional mean (conditional on X’s) to remove the impact of noise.) The moving average works on the same principle. It says if I take a local average of, for example, 5 periods, I will cancel some randomness. If I would take a longer average I will cancel some more randomness, but there is a point I can cancel too much (we will see this in detail with ARIMA models). The ACF would inform the average how many periods to take to cancel that randomness.

Both the ACF and PACF are tools that are nowadays considered somewhat obsolete, as they can only help you identify simple data generating processes, and real data are anything but simple. The logic behind them is quite useful though, as you can see most of the time series models, from simple averages to very complex “AI” models as a combination of autoregressive and moving average components.

Back to our regression! Our question was variable selection. I got 5 lags, which one to choose? Let’s use standard regression to decide.

## 
## Call:
## lm(formula = Yt ~ ., data = X)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -24.9787  -5.4014  -0.7869   4.3227  18.5888 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 25.34682   14.68609   1.726   0.0967 .
## Yt.1         0.35347    0.20263   1.744   0.0934 .
## Yt.2         0.33628    0.20066   1.676   0.1062  
## Yt.3         0.04011    0.21421   0.187   0.8530  
## Yt.4         0.14433    0.20240   0.713   0.4824  
## Yt.5        -0.11502    0.18480  -0.622   0.5393  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.23 on 25 degrees of freedom
## Multiple R-squared:  0.5663, Adjusted R-squared:  0.4795 
## F-statistic: 6.527 on 5 and 25 DF,  p-value: 0.0005188

If we would like to follow old-style stuff, we would reject insignificant variables and build on significant ones (this is wrong! just as an example we do this here):

## 
## Call:
## lm(formula = as.formula(Yt ~ Yt.1), data = X)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -25.8380  -6.7899  -0.6929   5.5511  19.8423 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   34.891     13.251   2.633   0.0134 *  
## Yt.1           0.658      0.131   5.024 2.38e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.54 on 29 degrees of freedom
## Multiple R-squared:  0.4653, Adjusted R-squared:  0.4469 
## F-statistic: 25.24 on 1 and 29 DF,  p-value: 2.376e-05

Both \(R^2\) and Adjusted \(R^2\) went down. What is happening here?

You may have seen the stepwise function.

## Start:  AIC=149.48
## Yt ~ Yt.1 + Yt.2 + Yt.3 + Yt.4 + Yt.5
## 
##        Df Sum of Sq    RSS    AIC
## - Yt.3  1      3.67 2617.9 147.52
## - Yt.5  1     40.51 2654.8 147.95
## - Yt.4  1     53.18 2667.4 148.10
## <none>              2614.3 149.48
## - Yt.2  1    293.68 2907.9 150.78
## - Yt.1  1    318.19 2932.4 151.04
## 
## Step:  AIC=147.52
## Yt ~ Yt.1 + Yt.2 + Yt.4 + Yt.5
## 
##        Df Sum of Sq    RSS    AIC
## - Yt.5  1     36.85 2654.8 145.95
## - Yt.4  1     62.82 2680.8 146.26
## <none>              2617.9 147.52
## - Yt.2  1    320.60 2938.5 149.10
## - Yt.1  1    388.75 3006.7 149.81
## 
## Step:  AIC=145.95
## Yt ~ Yt.1 + Yt.2 + Yt.4
## 
##        Df Sum of Sq    RSS    AIC
## - Yt.4  1     31.45 2686.2 144.32
## <none>              2654.8 145.95
## - Yt.2  1    324.85 2979.6 147.53
## - Yt.1  1    355.46 3010.2 147.85
## 
## Step:  AIC=144.32
## Yt ~ Yt.1 + Yt.2
## 
##        Df Sum of Sq    RSS    AIC
## <none>              2686.2 144.32
## - Yt.1  1    403.44 3089.7 146.66
## - Yt.2  1    536.47 3222.7 147.96
## 
## Call:
## lm(formula = Yt ~ Yt.1 + Yt.2, data = X)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -24.9634  -4.8274  -0.9334   4.6612  16.8083 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  25.7185    12.9082   1.992   0.0562 .
## Yt.1          0.3595     0.1753   2.051   0.0498 *
## Yt.2          0.3955     0.1672   2.365   0.0252 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.795 on 28 degrees of freedom
## Multiple R-squared:  0.5543, Adjusted R-squared:  0.5225 
## F-statistic: 17.41 on 2 and 28 DF,  p-value: 1.22e-05

This tries to start from a full model (all available variables are used), and then tries to see if by removing a variable the model improves. What score does it look to see if it improves? Note that it reports AIC (Akaike Information Criterion). This is what it tries to optimise. Not the significance (p-values), \(R^2\), or Adjusted \(R^2\). Let’s follow the transcript. It changes only one variable at the time. It reports for each action what would be the resulting AIC and always picks the one that minimises it.

What exactly is AIC?

\[AIC = -2log(\mathcal{L}) + 2k\]

where \(\mathcal{L}\) is what we call likelihood and \(k\) is the number of model parameters (its complexity). I think the next few lines may go a bit too deep, but follow through if you want to know what exactly likelihood is. Imagine I have a simple model \(y_t = \beta_0 + \beta_1 x_t + \varepsilon_t\). I need to estimate its parameters \(\beta_0\), \(\beta_1\), and the variance in the noise, a \(\sigma_{\varepsilon}\) for \(\varepsilon_t\). If I need a variance, I am already starting to put some assumptions about the distribution of noise. Recall the example of ice cream sales: the normal distribution is very often a reasonable assumption. To find the parameters of the model, just like with regression we will minimise the squared errors \(e_t^2\), but not under the condition that the errors are distributed normally. We do that, because we want to use that information to get a good estimation for the variance. If we did not assume that, then estimating the variance is much more complicated. The word likelihood implies a probability. What is the probability that, for example, \(\beta_0 = 3\) would be true when I have seen data like \(y_t\)? If I want to sum up probabilities I multiply them together. So I would say what is the probability for \(y_1\), \(y_2\), and so on. Then I would take the product of these probabilities. Generally working with products is difficult, we much rather prefer to work with sums. The way to transform a product to sum is to calculate its logarithm (we will return to this again, as this is quite important for your course.). The likelihood includes a product of probabilities, the log-likelihood includes a sum. Here is the formula of the likelihood assuming normally distributed errors: \[\mathcal{L}(\boldsymbol{\theta}, \sigma^2 | \boldsymbol{y}) = \prod_{t=1}^T \frac{1}{\sqrt{ 2 \pi \sigma^2 }} \exp \left( - \frac{e_t^2}{2 \sigma^2}\right)\] Let’s see what is important here. The \(\theta\) is a vector of all the parameters, and \(\sigma^2\) is the variance of the noise. The \(\frac{1}{\sqrt{ 2 \pi \sigma^2 }} \exp \left( - \frac{something}{2 \sigma^2}\right)\) is coming directly from the equation of the normal distribution, where “something” has zero mean and is normalised by \(\sigma^2\). So the only “moving part” in the likelihood is the errors (\(e_t^2\)). If I ignore the decoration, I have a product of squared errors. If I take the log of that I have a sum of squared errors. This we meed to minimise. Look at AIC, it has a minus in front of \(log(\mathcal{L})\). This minus makes the minimise into maximise, it reverses the direction of where is the optimal value. So when we maximise the likelihood of a model as long as we have normally distributed errors (and some additional weaker assumptions) we could just as well write the MSE (Mean Squared Error) instead.

We can simplify AIC to make it clearer what it does: \[AIC = -2log(MSE) + 2k\] and we want this to be as small as possible (it does not stop at zero, it can go negative). Or in human terms: \[AIC = Goodness\ of\ fit + Penalty\ for\ complexity\]

This is trying to emulate the bias-variance discussion from before. This is exactly what AIC does. It asks you to minimise MSE (maximise goodness of fit), i.e., reduce the bias of the model and inadvertently increase its variance. As you allow the model to has more parameters, i.e., more variables, explanatory, lags, indicators, whatnot, its complexity increases, and so does the penalty. AIC attempts to stop you when the increase in variance becomes so much that the total error will start increasing again.

(There are other information criteria, like the BIC, and their only difference is stronger or weaker penalty for complexity. For instance, BIC has a stronger penalty, resulting in smaller models. In practice, it does not matter that much which information criterion you use. People tend to get dogmatic about it, but the evidence just doesn’t add up, it does not matter that much, and actually in 2026 we have even better tools: teaser for shrinkage estimators!)

I have not discussed about the 2’s in the formula. You could very well remove those and you will get the same answer. They are there due to the derivation of AIC, to keep the math intact.

Long story short: Information criteria, like AIC, is the accepted way to choose variables for a model, or between different models, as long as their likelihoods can be compared. We can return to this statement when we look at choosing between models.

We identified the variables for the regression using a stepwise process. Stepwise will typically be okay, but it is not gauranteed to give you the best AIC model. Furthermore, the stepwise (or globally) best AIC model can contain statistically insignificant variables. This is fine, p-values do not tell us anything about whether a variable should be in the model or not, just if their coefficients are well estimated. AIC on the other hand, is trying to achive the best expected error (for unseen data) by balancing the bias and the variance. Simply put, AIC tries to optimise the predictive abilities of a model, not how well it fits in-sample (what p-values and \(R^2\) do). An AIC optimal model should be resistant to both over- and under-fitting. Let’s understand why. If the model is underfit, then the “goodness of fit part” is not large enough (as it goes with a minus in the AIC, that is bad - we want AIC to be as small as possible). If the model overfits, then the penalty is too large.

When we go to ARIMA we will see that it is very effective to choose the order (how many parameters) of the autoregression and moving averages using AIC, so all this will come together when we have the full model. For example, the stepwise above selected to keep the first to lasg, so we have a 2-order autoregressive part (AR) for the ARIMA.

If it is unclear, AIC does not care what your variables are. In your matrix of X’s you can have autoregressions, external variables, autoregressions of the external variables, indicators (dummy variables), past errors, etc. AIC will go through them 1-by-1 and will pick the best subset of variables to maximise the predictive performance of the model, according to its embedded penalty (what I am alluding to here is that this penalty may be inappropriate in some cases).

What about very highly correlated variables? (Multicollinearity) Can AIC deal with those? Technically, this is not an issue of the AIC, but of the stepwise process for selecting variables. With linear model multicollinearity is always an issue, and we cannot automate its treatment fully (we can get pretty close to that with shrinkage estimators, that we have not seen yet). However, a more pressing issue is when that high correlation between variables may be spurious. This is where stationarity comes in place.

Stationarity - the I in the ARIMA

When we say that a time series is stationary, we are saying that its mean at different points in time is the same. If there was some trend, the mean would change over time and the series would not be stationary. If we are looking only at the local mean, this is the so-called weak stationarity. We can also ask the same of variances, that the variance is constant over time (homescedastic). This is the so-called strong stationarity. If we see the time series as a distibtuion by stripping the time dimension away, we can ask for stationarity in all the moments of the distributions, like skewness, kurtosis, and so on.

First we foces on level-stationarity (the mean). We will look at variance stationarity together with GARCH models, as the practical implication of that is that your prediction intervals (your future variance) has some modellable structure. This can be very useful for portfolio analysis, and so on.

What is a trend? It could be what we can a deterministic trend, a straight line, or stochastic, there is a tendency to go up or down, but no consistency. Another way to think of that is that the trend is evolving over time. Here are some examples:

All these are examples of non-stationary time series. If we were to imagine calculating a rolling average over the data, its value would change across time. There is no consistent mean, so the series are not stationary. (There are formal statistical tests for that, and their logic is exactly that, calculate many local means, and see if they are statistically different or not.)

There are two issues when we model data that are non-stationary. First, it can lead us to spurious results. Second, it introduces a lot of issues with the estimation of model parameters (and therefore with variable selection, and uncertainty/variance estimation).

Correlation can be understood as co-movement. if two variables move together then they are correlated (it does not have to be the same direction, a negative correlation would make them move to opposite sides, but they still move together). Below I plot the Swedish per capita GDP and my cumulative ice cream consumption since year 2000.

They both co-move. They are highly correlated. One could build a regression here, and the regression would indicate a valid model:

## 
## Call:
## lm(formula = sweGDPpp ~ iceCream)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3703.0 -2107.0  -254.3  2414.7  4320.8 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 29249.705    951.855   30.73   <2e-16 ***
## iceCream       77.694      3.311   23.47   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2650 on 24 degrees of freedom
## Multiple R-squared:  0.9582, Adjusted R-squared:  0.9565 
## F-statistic: 550.7 on 1 and 24 DF,  p-value: < 2.2e-16

Obviously this is not correct. Swedish GDP is causally connected with my kannelbullar consumption, it would be funny if it was the ice cream.

We can show the same thing with seasonality. I can take any seasobal time series and show a connection with a sine wave of the appropriate frequency. All these connections are spurious.

A way to break the spurious connection is to ask do they change in the same way? Now I want to model the rate of change of each time series.

And now the regression is

## 
## Call:
## lm(formula = diff(sweGDPpp) ~ diff(iceCream))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3616.5  -554.3   -45.7   369.6  3276.2 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)  
## (Intercept)    1539.010    617.630   2.492   0.0204 *
## diff(iceCream)    9.776     28.163   0.347   0.7316  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1483 on 23 degrees of freedom
## Multiple R-squared:  0.005212,   Adjusted R-squared:  -0.03804 
## F-statistic: 0.1205 on 1 and 23 DF,  p-value: 0.7316

or better yet, ask AIC:

## Start:  AIC=367.01
## diff(sweGDPpp) ~ diff(iceCream)
## 
##                  Df Sum of Sq      RSS    AIC
## - diff(iceCream)  1    265069 50859268 365.14
## <none>                        50594199 367.01
## 
## Step:  AIC=365.14
## diff(sweGDPpp) ~ 1
## 
## Call:
## lm(formula = diff(sweGDPpp) ~ 1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3633.5  -729.6   -53.9   418.1  3381.4 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1727.1      291.1   5.932 4.03e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1456 on 24 degrees of freedom

Reasonably the regression found that this makes no sense. Modelling things in their rate of change can help break spurious connections. Now let’s look at this from a technical stand point. Rate of change is what in statistics we call differences: \[z_t = y_t - y_{t-1}\] That is a first order level difference. Let’s look what happens with the Swedish GDP when I do that.

The first differences of the Swedish DGP are now stationary, the series fluctuates around a mean (red line). Let’s look at my ice cream consumption:

Observe that my ice cream consumption rate of change is still trending upwards. FIrst differences were not sufficient to make this stationary. We can repeat to 2nd order differences. So now we take differences on the differences. This is the rate of change of the rate of change of the time series (or the acceleration of the time series - maybe I forgot to tell you, but we’re doing Physics as well!).

First we plot it, and then we look at the equations:

Now it is stationary. This is what I should be modelling.

Let’s look at the equations. We will introduce some new notation: \[ B(y_t) = y_{t-1}\] The \(B\) is called the backshift operator and all it does is add a lag to \(y_t\) (observe the \(t\) became \(t-1\)), Some books use \(L\) instead, from Lag.

The differences can be expressed as: \[(1-B)y_t = y_t - By_t = y_t - y_{t-1} = z_t\]

Second differences are quite simple: \[(1-B)^2y_t = (1-B)(1-B)y_t = (1-B)(y_y - By_t) = (1-B)(y_t - y_{t-1}) = (1-B)z_t = z_t - z_{t-1}\]

I can also have seasonal differences to remove (stochastic) seasonality from my data. Suppose I have daily data (7-days in a week): \[(1-B_7)y_t = y_t - y_{t-7}\]

I can chain as many differences as I want, level or seasonal, as it is evident from the second differences.

Before we go to the estimation implications, let me make a practical point. If you model the rate of change of the Swedish GDP, but the central bank or government needs the Swedish GDP you need to provide them a prediciton without the differences. This is done in the following way. If \(\hat{z}_{t+1}\) is your forecast for you next period, the forecast in \(y_t\) (not in differences) is: \[\hat{y}_{t+1} = y_t + \hat{z}_{t+1}\] The intuition is that since we are modelling the rate of change, our forecast is how the last observed value (\(y_t\)) will change in the next period, and that rate of change is \(\hat{z}_{t+1}\).

Let’s look at the estimation issues now. I will use the time series we had before for the lag selection, but I will add on top of it a stochastic trend.

We construct the first five lags and repeat the modelling from before. First, let’s look at the ACF and PACF, as these can give us a hint of what lags are important (at least in thi simplistic case).

Above are the new ones, below are what we had from the analysis before. Note how the wave in the ACF got “longer”, the trend has impossed stronger correlations, just as the GDP and ice creams appeared correlated due to the trend. As these were spuriously correlated, so are these stronger correlations spurious. Also note that the signal in PACF (what lags are important) is now masked.

Let’s estimate a regession on the data with the trend. First without selection.

## 
## Call:
## lm(formula = Yt ~ ., data = Z)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -31.1200  -7.9216   0.8674   6.5696  16.5206 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept) 20.444109  13.378857   1.528    0.142
## Yt.1         0.390565   0.229076   1.705    0.104
## Yt.2         0.206660   0.235417   0.878    0.390
## Yt.3        -0.023287   0.241161  -0.097    0.924
## Yt.4         0.327581   0.234578   1.396    0.178
## Yt.5        -0.009545   0.224201  -0.043    0.966
## 
## Residual standard error: 12.23 on 20 degrees of freedom
## Multiple R-squared:  0.8089, Adjusted R-squared:  0.7611 
## F-statistic: 16.93 on 5 and 20 DF,  p-value: 1.385e-06

For comparison, without the trend we had:

## 
## Call:
## lm(formula = Yt ~ ., data = X)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -24.9787  -5.4014  -0.7869   4.3227  18.5888 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 25.34682   14.68609   1.726   0.0967 .
## Yt.1         0.35347    0.20263   1.744   0.0934 .
## Yt.2         0.33628    0.20066   1.676   0.1062  
## Yt.3         0.04011    0.21421   0.187   0.8530  
## Yt.4         0.14433    0.20240   0.713   0.4824  
## Yt.5        -0.11502    0.18480  -0.622   0.5393  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.23 on 25 degrees of freedom
## Multiple R-squared:  0.5663, Adjusted R-squared:  0.4795 
## F-statistic: 6.527 on 5 and 25 DF,  p-value: 0.0005188

Look at how different the coefficients are. The model is quite lost, as the storng autocorrelations confuse it.

Below we ask AIC to recover the correct model:

## Start:  AIC=135.36
## Yt ~ Yt.1 + Yt.2 + Yt.3 + Yt.4 + Yt.5
## 
##        Df Sum of Sq    RSS    AIC
## - Yt.5  1      0.27 2989.3 133.36
## - Yt.3  1      1.39 2990.5 133.37
## - Yt.2  1    115.17 3104.2 134.34
## <none>              2989.1 135.36
## - Yt.4  1    291.45 3280.5 135.78
## - Yt.1  1    434.44 3423.5 136.89
## 
## Step:  AIC=133.36
## Yt ~ Yt.1 + Yt.2 + Yt.3 + Yt.4
## 
##        Df Sum of Sq    RSS    AIC
## - Yt.3  1      1.79 2991.1 131.38
## - Yt.2  1    115.38 3104.7 132.35
## <none>              2989.3 133.36
## - Yt.4  1    344.76 3334.1 134.20
## - Yt.1  1    480.50 3469.8 135.24
## 
## Step:  AIC=131.38
## Yt ~ Yt.1 + Yt.2 + Yt.4
## 
##        Df Sum of Sq    RSS    AIC
## - Yt.2  1    116.87 3108.0 130.38
## <none>              2991.1 131.38
## - Yt.4  1    398.98 3390.1 132.63
## - Yt.1  1    512.84 3504.0 133.49
## 
## Step:  AIC=130.37
## Yt ~ Yt.1 + Yt.4
## 
##        Df Sum of Sq    RSS    AIC
## <none>              3108.0 130.38
## - Yt.4  1     897.2 4005.2 134.97
## - Yt.1  1    1174.3 4282.3 136.71
## 
## Call:
## lm(formula = Yt ~ Yt.1 + Yt.4, data = Z)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -30.299  -7.722   1.463   6.538  17.086 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  22.2210    12.4370   1.787  0.08718 . 
## Yt.1          0.4798     0.1628   2.948  0.00722 **
## Yt.4          0.4013     0.1558   2.577  0.01686 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.62 on 23 degrees of freedom
## Multiple R-squared:  0.8013, Adjusted R-squared:  0.784 
## F-statistic: 46.37 on 2 and 23 DF,  p-value: 8.509e-09

It actually managed, but that is by no means what we would typically expect. The problem gets more complicated if we have to deal with additional types of regressors, e.g., moving average terms, or other explanatory variables. I think it is fair to say when we mix ARIMA components (so, autoregressions and moving averages) with explanatory variables, we don’t really know how to solve it optimally consistently, and kind of wing it. What should we difference? What should we leave as is? (I am exagerrating a bit, but to be able to answer this with statistical theory you need to put up with statistics and me for much more, as we would need to dive deeper in econometric theory.) Here is a simple heuristic. Make your target variable stationary. For the explanatory variables think if it would be the rate of change or the series as is that would connect with the stationary target variable. If I am modelling the rate of change of the GDP of Sweden, is it unemployment or the rate of change of unemployment? If unsure, prefer to gamble for more stationary series than not. Stuctural components like seasonality or trend should be modelled explcitly as such, and not using spurious correlations. We will see how this is easy to do with ARIMA (and regression).

At this point we have kind of seen two of the components of ARIMA, the AR and the I. By the way it is called I because people are pretentious. “i” from integration, as in continuous time domain, we have derivatives and integrals. Here is some random trivial. Before I came to Sweden I studied in Lancaster, UK. One of the guys who invented ARIMA, Jenkins, was a professor there, long dead before me, but his PhD students who actually did the first computer implementations were still active professors there. You get a very different understand when you get to talk to the people who actually wrote the thing - not deeper, just realising that everybody is winging it! Their theory of specifying the ARIMA models is veryeelgant mathematically, but nowadays largerly considered obsolete, with AIC having dominated. But all these, when we go into the full ARIMA model.

Logarithmic transformations

One last thing to cover in this document. What do logarithms do for us and why you will face them a lot when modelling financial time series.

Everything we have seen so far is linear, but there is a massive class of nonlinear models. One of the most useful nonlinearities is to model things proportionally. Let me give you an example. Suppose I am running an promotional campaign from my ice creams in Stockholm and in Karlstad. The centre of Stockholm has just under 1 million residents. The city of Karlstad is just under 70 000 people. I can write a model for my sales as: \[ Sales = \beta_0 + \beta_1 \times promotion + autoregresions + \ldots \] Here, let promotion be an indicator variable that is equal to 1 when the promotion is happening, or 0 when it is not happening. Let’s assume I get a significant uplift in sales due to promotions, say 50%. In Stockholm \(\beta_1 = 500 000\). If I would apply the same model in Karlstad, I would get silly forecasts, as I would forecast that more people than the full population would buy ice cream. Commendable effort Karlstadians, but will probably be a bad forecast. I would like though to have a model that can cover more cities, because then I can have more data, and then I can estimate my model better (think of the degrees of freedom increasing, giving us better estimates for the coefficients), Instead of modelling the effect as additive (linear), I could model it as multiplicative (nonlinear). Very simplified: \[ Sales = (stuff) \times promotion^{\beta_1} \] This model should estimate the 50% increase and be globally applicable to both Stockholm and Karlstad.

I mentioned when I was going over likelihood that multiplications complicate things for us. Instead of modelling the data in their raw form, we can calculate their logariths, Remember that \[log(xy) = log(x) + log(y)\] and \[log(x^p) = p log(x)\] So their function is to transform multiplicative (proportional) relationships to additive. Once we have estimated and built the model in the additive representation, we can take the exponent of the predition to return it back to the original scale.

Here is an example of another multiplicative relationship that can be modelled very easily with logarithms. At the beggining we show the Air Passengers time series. Here it is plotted as is, and in logarithms.

The raw data exhibit what we call a multiplicative seaonality. As the trend increases there are more passengers, the differences between summer and winter passenger traffic increases with the increase in the number of passengers. (This is fairly common, human activities are proporional, the more the people, the variability we see due to seasonality, promotions, disruptions, etc.). Look at the log variant of the time series. The seasonal width has stabilised. Techinically we disconnected it from the ammount of passengers. If the model was passengers*seasonality, now it became log(passengers) + log(seasonality), simplifying the modelling. In fact, neither regression, nor ARIMA can model correctly the raw AIrPassenger data, because of this nonlinearity. So we typically use logarithms or similar transformation.

Let me connect it to financial data. We are often interested in modelling returns of an asset. Return is how the value has changed from the previous period. This is the first differences we looked at before. If we think about multiple assets, we are interested in the percentage return, so that we can compare and decide how to invest. If we were looking at the value of an asset in SEK, small gains in expensive assets would look like a lot of SEK, while substantial assets in cheap assets would look insignificant, which is very misleading. It is the size of the investment times gain in value that matters. If I can buy a ton of the cheap asset and it increases value substnatially, that is a much better option than buying a few of the expensive asset that increase just a little bit. We care about percentage growth. This is proportion, so it has just the logarithmic differences.

Final point. If the size of the seasonality changes proportionally, the variability of the time series increases, i.e., the uncertainty increases. Logarithms get everything “under control”. Logarithms change the variability of the time series. Quite often a time series that appears to be heteroscedastic and therefore much harder to model, becomes homoscedastic (equal variance across time) and much easier to model with logarithms.

As an example, I will plot the differences of the Air Passengers without and with logarithms. See how the differences of the log(Air Passengers) have a much better behaved (homoscedastic) variance?

Okay final final point, we call them log-differences, but what we calculate is differences of logs. Logarithms of zero is -infinity, and there are no logarithms of negative numbers. As the plots above exemplify, differences will give you negative numbers (when the raw data decreases). So, we first take the logarithm and then apply differences.

Done!